Introduction to Statistics

Bennett Kleinberg

Week 2

Week 2

  • Central tendency
  • Variability of data

Note on sampling

Sampling

  • sampling is the process by which \(n\) observations are taken from a population of size \(N\)
  • this is one of the most important methods in the behavioural and social sciences
  • if the sampling is wrong, the rest is BS
  • GIGO principle (garbage in, garbage out)
  • more in week 4
  • for now: sample = subset of the population

Part 1: Central tendency

  • Aim: we want to describe the data
  • specifically: we want to express the center of the data distribution
  • remember: think of data = distribution

Example data

  • We take a sample of \(n=100\) of students at TiU
  • And ask: how many hours per week do you spend on YouTube?
  • Answers in full hours
hours_YT
8
11
7
15
11
8

Looking at the histogram

Describing central tendency

The MODE:

  • simple definition: the score (or category) with the highest frequency
  • works for all scales of data (think about nominal data)

Obtaining the mode

We look at the frequency table, and select the most frequently chosen option:

hours Freq
10 17
12 16
11 15
8 12
9 9

The mode is 10 hours.

The location of the mode

The mode and distribution shapes

(demo)

Describing central tendency

The MEAN:

  • often called the average
  • exact definition: the sum of all scores divided by the number of scores

Statistical notation:

\(\mu=\frac{\sum{X}}{N}\) (population mean)

\(M=\frac{\sum{X}}{n}\) (sample mean)

Calculating the mean

  • Sample size: \(n=5\)
  • YouTube hours watched data: \(5,7,9,14,6\)

\(\sum{X} = 5+7+9+14+6 = 41\)

\(M=\frac{\sum{X}}{n} = \frac{41}{5} = 8.20\)

Where is it in the distribution?

Mode and mean

Why not always the mean?

Suppose there are 10 friends (a, b, c, … j) in a bar. Each of them says how many hours they spend on YouTube last week.

Here’s their data:

name hours
a 15
b 6
c 2
d 2
e 4
f 12
g 6
h 15
i 3
j 7

Now another person enters. This friend, “k”, is a binge watcher. He says that last week he watched 50 hours of YouTube.

What do you think will happen to the mean?

New histogram

Beware of outliers

  • Mean before: \(M=\frac{\sum{X}}{n} = \frac{72}{10} = 7.20\)
  • Mean with binge-watcher: \(M=\frac{\sum{X}}{n} = \frac{122}{11} = 11.09\)

Extreme values can affect the mean!

The extreme values are often called outliers.

Another illustration

There are a hundred people in a bar. The average (mean) income is 30,000 EUR. Now Jeff Bezos walks in and suddenly everyone is billionaire.

These problems can be addressed:

  • mean trimming (not in this course)
  • another metric

Describing central tendency

The MEDIAN:

  • often called the midpoint
  • exact definition: the median splits the distribution in half

Example

The friend data:

name hours
a 15
b 6
c 2
d 2
e 4
f 12
g 6
h 15
i 3
j 7
k 50

Obtaining the median

  1. sorting the data
x
2
2
3
4
6
6
7
12
15
15
50

Obtaining the median

  1. find the value that lies in the middle

Here: We know we have 11 values, so the 6th value has 5 points to its left and right.

Mean and median

Special cases

Distributions without “clear” midpoint:

  • data: \(4,15,13,14,38,3\)
  • sorted data: \(3,4,13,14,15,38\)
  • median?

In this case, we take the two middle values and obtain the average:

  • median = \(\frac{13+14}{2}=13.5\)

Part 2: Variability

  • Aim: we want to describe the data
  • specifically: we want to express how much the scores in the data differ
  • also called the spread of the data (or lack thereof)

New data example

  • grades for Intro to Statistics at first attempt for \(N=10\)
id grade
A K 5
B L 3
C M 6
D N 6
E O 7
F P 8
G Q 6
H R 9
I S 8
J T 10

How can we express data variability?

  • The easiest way: we take the lowest value and the highest value
  • \(\min grade = 3\)
  • \(\max grade = 10\)

\(range = \max - \min\)

See also p. 102 in the book.

A bit more nuanced

  • maybe we calculate how much each score differs from the (population) mean
  • \(\mu = 6.8\)
id grade dist_to_mean
A K 5 -1.8
B L 3 -3.8
C M 6 -0.8
D N 6 -0.8
E O 7 0.2
F P 8 1.2
G Q 6 -0.8
H R 9 2.2
I S 8 1.2
J T 10 3.2

What is problematic?

This procedure gives us the deviation score (from the mean) for each value

\(deviation = X - \mu\)

  • Think about what the mean actually is
  • It is - by definition - the balancing point
  • Have a look…

Deviation and the mean

Deviations sum to 0

Common trick: Squaring the difference

id grade dist_to_mean sq_dev
A K 5 -1.8 3.24
B L 3 -3.8 14.44
C M 6 -0.8 0.64
D N 6 -0.8 0.64
E O 7 0.2 0.04
F P 8 1.2 1.44
G Q 6 -0.8 0.64
H R 9 2.2 4.84
I S 8 1.2 1.44
J T 10 3.2 10.24

The \(x^2\) trick

  • removes negative values
  • “punishes” larger values
  • \(2^2 = 4\)
  • \(4^2 = 16\)
  • Note: differences are also squared
  • When we double \(x\), we quadruple \(x^2\)

From deviation to variance

We can obtain a more meaningful measure now.

The mean of squared deviations is called the variance.

\(var = \frac{\sum{(X-\mu)^2}}{N}\)

Stepwise: deviation

\(\mu = 5.4\)

id grade dev
A K 5 -0.4
B L 3 -2.4
C M 6 0.6
D N 6 0.6
E O 7 1.6

Stepwise: squared deviation

id grade dev sq_dev
A K 5 -0.4 0.16
B L 3 -2.4 5.76
C M 6 0.6 0.36
D N 6 0.6 0.36
E O 7 1.6 2.56

\(var = \frac{\sum{(X-\mu)^2}}{N} = \frac{9.2}{5} = 1.84\)

Stepwise: the standard deviation

  • among the most frequently used statistics for variability
  • standard in most research papers

\(SD = \sqrt{var}\)

\(\sigma = \sqrt{\frac{\sum{(X-\mu)^2}}{N}}\)

Here: \(\sigma = \sqrt{\frac{9.2}{5}} = \sqrt{1.84} = 1.36\)

Sum of squares

  • an alternative approach is to first go through the sum of squared deviations (SS)
  • this: \(\sum{(X-\mu)^2}\)

Then:

\(var = \frac{SS}{N}\)

\(\sigma = \sqrt{\frac{SS}{N}}\)

This is why \(var\) is also noted as \(\sigma^2\)

Remember populations and samples?

Until here: the variability statistics were for the population

The sample is biased (i.e. over- or underestimated):

  • here this means it will underestimate the variability of the population
  • we can correct for this
  • this is where we need the sum of squares

Correcting for bias

We make the value slightly larger, by decreasing the denominator:

\(sample\ variance = \frac{SS}{n-1}\)

\(s = \sqrt{\frac{SS}{n-1}}\)

Compare:

  • \(\frac{SS}{N} = \frac{9.2}{5} = 1.84\) vs \(\frac{SS}{n-1} = \frac{9.2}{4} = 2.30\)
  • \(\sqrt{\frac{9.2}{5}} = 1.36\) vs \(\sqrt{\frac{9.2}{4}} = 1.52\)

Examples of reporting summary statistics

show that the judgments are closer to the true emotion score in the longer texts (M=1.19, SD=1.88) than in the shorter ones (M=2.00, SD=2.35), Cohen’s d = 0.38 [99% CI: 0.30; 0.45]

Examples of reporting summary statistics

The temporal evolution of a far‑right forum

Recap

  • we can describe the center of the data
    • mode
    • mean
    • median
  • we can also describe how much the data is spread out
    • range
    • deviation –> variance –> standard deviation
    • correcting for sample bias in sample statistics

Next week

  • probability
  • z-scores